Day3 Reinforce part-1

2023 iThome 鐵人賽

DAY 3

AI & Data

ㄟ唉廚房交響樂多智能的煮飯秀系列第 3 篇

15th鐵人賽

皮卡喵

2023-09-17 14:33:58

400 瀏覽

分享至

這個章節我們就來介紹 Reinforcement，另外一個大名鼎鼎的Q-function，歡迎大家迎駕到數年前的文章觀看 → Day17~Day22

Reinforce

沒錯你沒看錯，它就叫做 Reinforce 就是整個強化學習的根基，還會有一個 critic 指示我們的演算法( policy )每個動作的價值。這是我們就來跑一個toy case，讓大家體驗強化學習的樂趣。

這裡我們借用 pytorch examples 來教學

這邊我們可以看到 35 行

Policy

class Policy(nn.Module):
    """
    implements both actor and critic in one model
    """
    def __init__(self):
        super(Policy, self).__init__()
        self.affine1 = nn.Linear(4, 128)

        # actor's layer
        self.action_head = nn.Linear(128, 2)

        # critic's layer
        self.value_head = nn.Linear(128, 1)

        # action & reward buffer
        self.saved_actions = []
        self.rewards = []

這裡我們會把 Reinforce 的網路抽象化，用 Policy 的名字去建立它，這邊會有三層網路

affine1 作為資料的輸
共享網路，前面處理資料的特徵，接著輸入給後面兩個網路 ( actor and critic )
actor network 動作輸出
演算法控制物體的輸出，決定像是移動、射擊等動作
critic network 判斷價值
判斷現在的 value 值為多少，某個動作的值越高，則該輸出的機率越高


def forward(self, x):
        """
        forward of both actor and critic
        """
        x = F.relu(self.affine1(x))

        # actor: choses action to take from state s_t
        # by returning probability of each action
        action_prob = F.softmax(self.action_head(x), dim=-1)

        # critic: evaluates being in the state s_t
        state_values = self.value_head(x)

        # return values for both actor and critic as a tuple of 2 values:
        # 1. a list with the probability of each action over the action space
        # 2. the value from state s_t
        return action_prob, state_values

forward 這邊就是處理網路的計算了，這邊可以看到有兩個輸出，action_head 跟 value_head：

action_head 出來後，會先經過一個 softmax layer ，這裡確保輸出的動作可以在，全部輸出總值為1，且不為負數的值，這種符合機率條件的神經網路模型。你就想像你的遊戲控制移動就只有，往前走、往後移動以及不做任何動作。範例：P = p(往前走) + p(往後移動) + p(不做任何動作) = 1
1. value_head 則是輸出估計值，估算我們得到一個獎勵值應該為多少。